Telegram Group & Telegram Channel
https://arxiv.org/abs/2305.18290 #llm #ai

今天深入学习了 DPO,再次感叹扎实的数学功底对 AI/ML Research 的重要性……

原始的 RLHF 是用 pairwise human preference data(A 和 B 哪个更好)去训练一个 reward model,然后用 RL 来训练主 policy model,objective 是 minimize negative log likelihood + regularization(比如 PPO 就是通过新旧 policy 之间的 KL Divergence 来做 regularization)。这样的缺点在于 RL 是出了名的难搞,而且还需要一个 critic model 来预测 reward,使得整个系统的复杂性很高。

DPO 的思路是,观察到 RLHF 的 objective 本质上是 minimize loss over (latent) reward function,通过一番 reparameterization 等数学推导,重新设计了一个 minimize loss over policy 的 objective,绕过了中间这个 reward model,让 gradient update 直接增加 policy model 生成 winner response 的概率并降低 loser response 的概率,大幅简化了流程。

拓展阅读:
- KTO: 更进一步,不需要 pairwise comparison,只用对 individual example 的 upvote/downvote 也可以学习到 preference。
- IPO: 解决 DPO 容易 overfit 的问题。



tg-me.com/LinghaoCh/938
Create:
Last Update:

https://arxiv.org/abs/2305.18290 #llm #ai

今天深入学习了 DPO,再次感叹扎实的数学功底对 AI/ML Research 的重要性……

原始的 RLHF 是用 pairwise human preference data(A 和 B 哪个更好)去训练一个 reward model,然后用 RL 来训练主 policy model,objective 是 minimize negative log likelihood + regularization(比如 PPO 就是通过新旧 policy 之间的 KL Divergence 来做 regularization)。这样的缺点在于 RL 是出了名的难搞,而且还需要一个 critic model 来预测 reward,使得整个系统的复杂性很高。

DPO 的思路是,观察到 RLHF 的 objective 本质上是 minimize loss over (latent) reward function,通过一番 reparameterization 等数学推导,重新设计了一个 minimize loss over policy 的 objective,绕过了中间这个 reward model,让 gradient update 直接增加 policy model 生成 winner response 的概率并降低 loser response 的概率,大幅简化了流程。

拓展阅读:
- KTO: 更进一步,不需要 pairwise comparison,只用对 individual example 的 upvote/downvote 也可以学习到 preference。
- IPO: 解决 DPO 容易 overfit 的问题。

BY Parallel Experiments




Share with your friend now:
tg-me.com/LinghaoCh/938

View MORE
Open in Telegram


Parallel Experiments Telegram | DID YOU KNOW?

Date: |

The SSE was the first modern stock exchange to open in China, with trading commencing in 1990. It has now grown to become the largest stock exchange in Asia and the third-largest in the world by market capitalization, which stood at RMB 50.6 trillion (US$7.8 trillion) as of September 2021. Stocks (both A-shares and B-shares), bonds, funds, and derivatives are traded on the exchange. The SEE has two trading boards, the Main Board and the Science and Technology Innovation Board, the latter more commonly known as the STAR Market. The Main Board mainly hosts large, well-established Chinese companies and lists both A-shares and B-shares.

Telegram Gives Up On Crypto Blockchain Project

Durov said on his Telegram channel today that the two and a half year blockchain and crypto project has been put to sleep. Ironically, after leaving Russia because the government wanted his encryption keys to his social media firm, Durov’s cryptocurrency idea lost steam because of a U.S. court. “The technology we created allowed for an open, free, decentralized exchange of value and ideas. TON had the potential to revolutionize how people store and transfer funds and information,” he wrote on his channel. “Unfortunately, a U.S. court stopped TON from happening.”

Parallel Experiments from vn


Telegram Parallel Experiments
FROM USA